{
"nbformat": 4,
"nbformat_minor": 4,
"metadata": {
"title": "Shuffle vs. Resample",
"kernelspec": {
"name": "ir",
"display_name": "R",
"language": "R"
},
"language_info": {
"name": "R",
"codemirror_mode": "r",
"pygments_lexer": "r",
"mimetype": "text/x-r-source",
"file_extension": ".r",
"version": "4.0.3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I created this notebook to help us play around with the concepts of shuffle and resample."
]
},
{
"cell_type": "code",
"metadata": {
"trusted": false
},
"execution_count": 58,
"source": [
"# This code will load the R packages we will use\n",
"\n",
"suppressPackageStartupMessages({\n",
" library(mosaic)\n",
" library(supernova)\n",
" library(ggpubr)\n",
" #library(Lock5withR)\n",
" #library(fivethirtyeight)\n",
"})\n",
"\n",
"font_size = function (size) {\n",
" theme(text = element_text(size = size))}\n",
"\n",
"\n",
"middle <- function(x, prop = .95) {\n",
" sorted <- sort(x)\n",
" tail_size <- (1 - prop) / 2\n",
" upper_cut <- sorted[floor(length(x) * (1 - tail_size))]\n",
" lower_cut <- sorted[ceiling(length(x) * tail_size)]\n",
" ((x <= upper_cut) + (x > lower_cut)) == 2\n",
"}\n",
"\n",
"\n",
"# Adjust the plots to have nice sizes\n",
"options(repr.plot.width = 6, repr.plot.height = 4)\n",
"\n",
""
],
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## An Extreme Data Set\n",
"\n",
"It may be easier to appreciate what `shuffle()` is doing with a more extreme example rather than a more realistic data set!\n",
"\n",
"Imagine two groups, A and B. All cases in group A have an outcome of 0 and all those in B have an outcome of 10. For now let's assume there are 50 cases in each group (but you can play around and set `n` to anything you want)."
]
},
{
"cell_type": "code",
"metadata": {
"trusted": false
},
"execution_count": 40,
"source": [
"n <- 50\n",
"\n",
"group <- c(rep(\"A\", n), rep(\"B\",n))\n",
"outcome <- c(rep(0, n), rep(10,n))\n",
"extremedata <- data.frame(outcome, group)\n",
"\n",
"head(extremedata, 2*n)"
],
"outputs": [
{
"output_type": "execute_result",
"execution_count": 40,
"metadata": {},
"data": {
"text/html": [
"
\n",
" \n",
" \n",
" | \n",
" outcome | \n",
" group | \n",
"
\n",
" \n",
" | \n",
" <dbl> | \n",
" <chr> | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 6 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 7 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 8 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 9 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 10 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 11 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 12 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 13 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 14 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 15 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 16 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 17 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 18 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 19 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 20 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 21 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 22 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 23 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 24 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 25 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 26 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 27 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 28 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 29 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" 30 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
" ⋮ | \n",
" ⋮ | \n",
" ⋮ | \n",
"
\n",
" \n",
" 71 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 72 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 73 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 74 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 75 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 76 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 77 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 78 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 79 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 80 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 81 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 82 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 83 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 84 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 85 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 86 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 87 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 88 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 89 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 90 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 91 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 92 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 93 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 94 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 95 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 96 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 97 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 98 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 99 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
" 100 | \n",
" 10 | \n",
" B | \n",
"
\n",
" \n",
"
"
]
}
}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Linking _randomness_ and `shuffle()`\n",
"\n",
"We're trying to link together \"randomly generated data\" to `shuffle()` to having 0 difference between groups (represented as $\\beta_1 = 0$ and ultimately $Y_i = \\beta_0 + \\epsilon_i$). Let's take a look at why.\n",
"\n",
"Our groups start off very extreme. Group perfectly predicts outcome. If you knew a case was in group A, you would predict they would have an outcome of 0 and there would be no error.\n",
"\n",
"Indeed, if you look at the `favstats` you'll see that the means are 0 and 10 (and there is no spread at all within groups). In this case, the data have a very large $b_1$! That is, $b_1 = 10$!"
]
},
{
"cell_type": "code",
"metadata": {
"trusted": false
},
"execution_count": 72,
"source": [
"gf_histogram(~ outcome, data = extremedata) %>%\n",
"gf_facet_grid(group ~ .) %>%\n",
"gf_vline(xintercept = ~mean, data = favstats(outcome ~ group, data = extremedata))\n",
"\n",
"favstats(outcome ~ group, data = extremedata)\n",
"\n",
"print(\"This is the sample's b1:\")\n",
"b1(outcome ~ group, data = extremedata)"
],
"outputs": [
{
"output_type": "execute_result",
"execution_count": 72,
"metadata": {},
"data": {
"text/html": [
"\n",
" \n",
" \n",
" group | \n",
" min | \n",
" Q1 | \n",
" median | \n",
" Q3 | \n",
" max | \n",
" mean | \n",
" sd | \n",
" n | \n",
" missing | \n",
"
\n",
" \n",
" <chr> | \n",
" <dbl> | \n",
" <dbl> | \n",
" <dbl> | \n",
" <dbl> | \n",
" <dbl> | \n",
" <dbl> | \n",
" <dbl> | \n",
" <int> | \n",
" <int> | \n",
"
\n",
" \n",
" \n",
" \n",
" A | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 50 | \n",
" 0 | \n",
"
\n",
" \n",
" B | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 0 | \n",
" 50 | \n",
" 0 | \n",
"
\n",
" \n",
"
"
]
}
},
{
"output_type": "stream",
"name": "stdout",
"text": "[1] \"This is the sample's b1:\"\n"
},
{
"output_type": "execute_result",
"execution_count": 72,
"metadata": {},
"data": {
"text/html": [
"10"
]
}
},
{
"output_type": "display_data",
"metadata": {},
"data": {
"image/png": [
""
]
}
}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What would happen if we shuffled these groups? (I'll color the shuffled data in `dodgerblue`.)"
]
},
{
"cell_type": "code",
"metadata": {
"trusted": false
},
"execution_count": 73,
"source": [
"gf_histogram(~ shuffle(outcome), data = extremedata, fill = \"dodgerblue\") %>%\n",
"gf_facet_grid(group ~ .) "
],
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"image/png": [
""
]
}
}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that because the groups are randomly generated, A has an equal chance of being associated with a high or low score (and so does B). So now take a look at the shuffled $b_1$! Observe how small these shuffled $b_1$s are. (Run the code below a few times.)"
]
},
{
"cell_type": "code",
"metadata": {
"trusted": false
},
"execution_count": 71,
"source": [
"extremedata$shuff_outcome <- shuffle(extremedata$outcome)\n",
"\n",
"gf_histogram(~ shuff_outcome, data = extremedata, fill = \"dodgerblue\") %>%\n",
"gf_facet_grid(group ~ .) %>%\n",
"gf_vline(xintercept = ~mean, data = favstats(shuff_outcome ~ group, data = extremedata))\n",
"\n",
"print(\"This is a shuffled b1:\")\n",
"b1(shuff_outcome ~ group, data = extremedata)"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "[1] \"This is a shuffled b1:\"\n"
},
{
"output_type": "execute_result",
"execution_count": 71,
"metadata": {},
"data": {
"text/html": [
"1.6"
]
}
},
{
"output_type": "display_data",
"metadata": {},
"data": {
"image/png": [
""
]
}
}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The shuffled $b_1$s end up being smaller because the randomized groups are very similar to each other now! So we characterize the **random process (of shuffle)** as a DGP where $\\beta_1=0$, the true difference is 0. And now, many of the shuffled data look very much like they could have been generated by the empty model (the grand mean of the whole data set -- depicted in `blue`)."
]
},
{
"cell_type": "code",
"metadata": {
"trusted": false
},
"execution_count": 70,
"source": [
"extremedata$shuff_outcome <- shuffle(extremedata$outcome)\n",
"\n",
"gf_histogram(~ shuff_outcome, data = extremedata, fill = \"dodgerblue\") %>%\n",
"gf_facet_grid(group ~ .) %>%\n",
"gf_vline(xintercept = ~mean, data = favstats(shuff_outcome ~ group, data = extremedata)) %>%\n",
"gf_vline(xintercept = ~mean, data = favstats(~shuff_outcome, data = extremedata), color = \"blue\")\n",
""
],
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"image/png": [
""
]
}
}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Linking _DGP is basically just like our data_ and `resample()` (resampling our cases)\n",
"\n",
"We're also trying to link together \"DGP is like our data\" to `resample()` to saying as $\\beta_1 = \\b_1 = 10$ and ultimately $Y_i = \\beta_0 + 10X_i + \\epsilon_i$). Let's take a look at why.\n",
"\n",
"When we resample (depicted in `tomato`), we get a very similar plot to our original data."
]
},
{
"cell_type": "code",
"metadata": {
"trusted": false
},
"execution_count": 69,
"source": [
"gf_histogram(~ outcome, data = resample(extremedata), fill = \"tomato\") %>%\n",
"gf_facet_grid(group ~ .) "
],
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"image/png": [
""
]
}
}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make it a bit easier to see, here we've put the original extreme data (in gray), the resampled data (in `tomato`), and shuffled data (in `dodgerblue`) next to each other."
]
},
{
"cell_type": "code",
"metadata": {
"trusted": false
},
"execution_count": 76,
"source": [
"originalplot <- gf_histogram(~ outcome, data = extremedata) %>%\n",
"gf_facet_grid(group ~ .) \n",
"\n",
"resampleplot <- gf_histogram(~ outcome, data = resample(extremedata), fill = \"tomato\") %>%\n",
"gf_facet_grid(group ~ .) \n",
"\n",
"shuffleplot <- gf_histogram(~ shuff_outcome, data = extremedata, fill = \"dodgerblue\") %>%\n",
"gf_facet_grid(group ~ .) \n",
"\n",
"ggarrange(originalplot, resampleplot, shuffleplot, \n",
" labels = c(\"Original\", \"Resample\", \"Shuffle\"),\n",
" ncol = 3)"
],
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"image/png": [
""
]
}
}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that because the groups are made up of resampled cases, resampled group A is still made up of cases that used to be in the original group A (and the same with group B). So now take a look at the resampled $b_1$! Indeed, if you run the code a few times, the resampled $b_1$s are the same as our original $b_1$."
]
},
{
"cell_type": "code",
"metadata": {
"trusted": false
},
"execution_count": 75,
"source": [
"resampledata <- resample(extremedata)\n",
"\n",
"gf_histogram(~ outcome, data = resampledata, fill = \"tomato\") %>%\n",
"gf_facet_grid(group ~ .) %>%\n",
"gf_vline(xintercept = ~mean, data = favstats(outcome ~ group, data = resampledata))\n",
"\n",
"print(\"This is a resampled b1:\")\n",
"b1(outcome ~ group, data = resampledata)\n",
""
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "[1] \"This is a resampled b1:\"\n"
},
{
"output_type": "execute_result",
"execution_count": 75,
"metadata": {},
"data": {
"text/html": [
"10"
]
}
},
{
"output_type": "display_data",
"metadata": {},
"data": {
"image/png": [
""
]
}
}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So to recap resample, we characterize the **DGP like our sample** as a DGP where $\\beta_1=b_1$, the true difference is the same as that of our sample (10). And now, many of the resampled data look very much like they could have been generated by the best fitting complex model model (the two group means).\n",
"\n",
"## A note about wording\n",
"\n",
"For our purposes, it's easier to call it `shuffle()` versus `resample()` because those are the R functions we use. In data science and statistics, they call `shuffle()`!number(0)**RANDOMIZATION** and they call `resample()`!number(0)**BOOTSTRAPPING**."
]
}
]
}